In [1]:
from __future__ import print_function
from __future__ import division
In [2]:
import numpy as np
Normally, one would want a very generalized way of reading in output files (like an argparse input argument with nargs='+' that gets looped over in a big out, but this is more to demonstrate the parsing of this specific kind of file, so we use
In [3]:
outputfilepath = "../qm_files/drop_0001_1qm_2mm_eda_covp.out"
Normally, we might also do this, where we read the contents of the entire file in as a string. This might be a bad idea for these files, since they can grow to several megabytes.
In [4]:
# with open(outputfilepath) as outputfile:
# raw_contents = outputfile.read()
It's more efficient to loop over the file directly, which avoids having to read the whole thing into memory. This does mean that you can't open and close it right away; you add another level of indentation.
In [5]:
with open(outputfilepath) as outputfile:
for line in outputfile:
if "Total energy in the final basis" in line:
print(line, end='')
Actually, it might be instructive to do a timing comparison between the two approaches.
In [6]:
searchstr = "Total energy in the final basis"
In [7]:
%%timeit -n 1000 -r 10
counter = 0
with open(outputfilepath) as outputfile:
raw_contents = outputfile.read()
for line in iter(raw_contents.splitlines()):
if searchstr in line:
counter += 1
In [8]:
%%timeit -n 1000 -r 10
counter = 0
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
counter += 1
It looks like there's a very slight time penalty the second way, and it might be generally true that memory-efficient algorithms usually require more CPU time. The second way also looks a little bit cleaner, and it's easier to understand what's going on.
Let's change the string we're looking for to one that's more relevant to the EDA/COVP analysis.
In [9]:
searchstr = "Energy decomposition of the delocalization term, kJ/mol"
In [10]:
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
print(line, end='')
That's fine, but we also want some of the lines that come after the header.
In [11]:
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
# print 10 lines instead
for _ in range(10):
print(line, end='')
line = next(outputfile)
Here we've used two tricks:
Just because we can define variables in loops (like when range() or zip() or enumerate() are used) doesn't mean we need to use them. Sometimes you'll see _ used as the loop variable when it doesn't matter what it's called, but you still need to assign a variable for a function call or something else to work properly.
Any file that's open where you have access to the handle (called outputfile in the above example), or anything that can be wrapped with an iter() to make it iterable, can have the next() function called on it to return the next item. In the case of files, you iterate over the lines one by one (separated by newlines). That's why I have the statement for line in outputfile:, where outputfile is the iterator and line is the variable that contains whatever the latest item is from the outputfile iterator.
To learn more about iterators, there's the official documentation, and I found this Stack Overflow post: http://stackoverflow.com/questions/9884132/what-exactly-are-pythons-iterator-iterable-and-iteration-protocols
Usually, we don't specify a set number of extra lines to iterate, because that number isn't fixed. Instead, we parse until we hit some other line that's a good stopping point. Here is the full block we're interested in, plus the start of the other one for some context:
*-------------------------------------------------------------------*
* Energy decomposition of the delocalization term, kJ/mol *
*-------------------------------------------------------------------*
DEL from fragment(row) to fragment(col)
-------------------------------------------------------------------
1 2
1 0.00000 -9.01048
2 -6.09647 -0.00000
---------------------------------------------------------------------
---------------------------------------------------------------------
* Charge transfer analysis *
* R.Z.Khaliullin, A.T. Bell, M.Head-Gordon *
* J. Chem. Phys., 2008, 128, 184112 *
*-------------------------------------------------------------------*
The "variable" part of parsing here is the number of rows and columns between the two lines of dashes that come after DEL from.... That's the line we should really be search for, since it's unique in the output file, and it's closer to the lines we want to extract.
Here's the idea.
In [12]:
searchstr = "DEL from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
# This is now the line with the dashes.
line = next(outputfile)
# This is now the line with the column indices.
line = next(outputfile)
# Skip again to get the first line we want to parse.
line = next(outputfile)
# This ensures the parsing will terminate once the block is over.
while list(set(line.strip())) != ['-']:
print(line, end='')
line = next(outputfile)
Now we're printing the correct rows. How should we store these values? It's probably best to put them in a NumPy array, but since that array needs to be allocated beforehand, we need to know the shape (which is the number of fragments. How do we get that?
In [13]:
searchstr_num_fragments = "SCF on fragment 1 out of"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr_num_fragments in line:
nfragments = int(line.split()[-1])
print(nfragments)
Now, combine the two and place a NumPy array allocation in the middle.
The last tricky bit will be assigning the text/string values to array elements. We're going to use the slicing syntax for both the NumPy array and the string we're splitting.
In [14]:
searchstr_num_fragments = "SCF on fragment 1 out of"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr_num_fragments in line:
nfragments = int(line.split()[-1])
# create an empty array (where we don't initialize the elements to 0, 1, or anything else)
fragment_del_energies = np.empty(shape=(nfragments, nfragments))
searchstr = "DEL from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
line = next(outputfile)
line = next(outputfile)
line = next(outputfile)
# We need to keep track of our row index with a counter, because we can't
# use enumerate with a while loop.
# We need to keep track of our row index in the first place because we're
# indexing into a NumPy array.
row_counter = 0
while list(set(line.strip())) != ['-']:
# 'map' float() onto every element of the list
# map() returns a generator, so turn it back into a list
sline = list(map(float, line.split()[1:]))
# set all columns in a given row to
fragment_del_energies[row_counter, :] = sline
line = next(outputfile)
row_counter += 1
In [15]:
print(fragment_del_energies)
It's probably a good idea to turn these into functions, so for an arbitrary calculation, they can be run.
In [16]:
def get_num_fragments(outputfilepath):
"""Given a path to an output file, figure out how many fragments are part of it.
"""
searchstr_num_fragments = "SCF on fragment 1 out of"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr_num_fragments in line:
nfragments = int(line.split()[-1])
return nfragments
In [17]:
def get_eda_fragment_delocalization_energies(outputfilepath, nfragments):
"""Given a path to an output file and the number of fragments it contains, return the
delocalization energies between fragments.
"""
fragment_del_energies = np.empty(shape=(nfragments, nfragments))
searchstr = "DEL from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
line = next(outputfile)
line = next(outputfile)
line = next(outputfile)
row_counter = 0
while list(set(line.strip())) != ['-']:
sline = list(map(float, line.split()[1:]))
fragment_del_energies[row_counter, :] = sline
line = next(outputfile)
row_counter += 1
return fragment_del_energies
Now let's use it:
In [18]:
nfragments = get_num_fragments(outputfilepath)
fragment_del_energies = get_eda_fragment_delocalization_energies(outputfilepath, nfragments)
print(fragment_del_energies)
We can write something almost identical for the decompsition of the charge transfer term, which measures the number of millielectrons that move between fragments:
In [19]:
def get_eda_fragment_delocalization_millielectrons(outputfilepath, nfragments):
"""Given a path to an output file and the number of fragments it contains,
return the number of millielectrons that delocalize between fragments.
"""
fragment_del_millielectrons = np.empty(shape=(nfragments, nfragments))
searchstr = "Delocalization from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
line = next(outputfile)
line = next(outputfile)
line = next(outputfile)
row_counter = 0
while list(set(line.strip())) != ['-']:
sline = list(map(float, line.split()[1:]))
fragment_del_millielectrons[row_counter, :] = sline
line = next(outputfile)
row_counter += 1
return fragment_del_millielectrons
In [20]:
fragment_del_millielectrons = get_eda_fragment_delocalization_millielectrons(outputfilepath, nfragments)
print(fragment_del_millielectrons)
The easier we make it to reuse our code for new calculations, the faster we get to analysis and thinking about our data.
Since we're "delocalizing" from row to column, we should be able to get the total number of millielectrons donated by each fragment as the sum over all columns for each row. To get the total number of millielectrons accepted by a fragment, we can take the sum over all rows for a given column.
For this particular calculation, fragment 1 is a combined anion/cation ionic liquid pair, and fragment 2 is CO$_2$. Knowing this, we probably expect more charge to shift from the ionic liquid onto the CO$_2$, though it's hard to say that conclusively since the anion can just delocalize onto the cation (the whole fragment is of course charge neutral). So, it shouldn't be too surprising if the numbers aren't very different.
In [21]:
me_donated_by_il = np.sum(fragment_del_millielectrons[0, :])
me_donated_by_co2 = np.sum(fragment_del_millielectrons[1, :])
print(me_donated_by_il, me_donated_by_co2)
There's a net donation of charge density from the ionic liquid onto the CO$_2$, as expected.
What about charge accepted?
In [22]:
me_accepted_by_il = np.sum(fragment_del_millielectrons[:, 0])
me_accepted_by_co2 = np.sum(fragment_del_millielectrons[:, 1])
print(me_accepted_by_il, me_accepted_by_co2)
The values are almost exactly the opposite of the charge donation values. Why aren't they exactly the same?
There's an additional section of output that can be requested when performing calculations with only two fragments; complementary occupied-virtual pairs (COVPs) can be formed which allows for a direct assignment between a donor orbital on one fragment with an acceptor on the other. The amount of charge transferred between COVPs in both directions is calculated in terms of energy and millielectrons.
---------------------------------------------------------------------
* Complementary occupied-virtual pairs *
* Delta E, kJ/mol; Delta Q, me- *
* No BSSE correction *
---------------------------------------------------------------------
From fragment 1 to fragment 2
---------------------------------------------------------------------
# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)
---------------------------------------------------------------------
1 -3.1119( 69.1%) -3.1119( 69.1%) 1.805( 74.7%) 1.805( 74.7%)
2 -0.9232( 20.5%) -0.9232( 20.5%) 0.415( 17.2%) 0.415( 17.2%)
3 -0.2344( 5.2%) -0.2344( 5.2%) 0.119( 4.9%) 0.119( 4.9%)
4 -0.0771( 1.7%) -0.0771( 1.7%) 0.034( 1.4%) 0.034( 1.4%)
5 -0.0536( 1.2%) -0.0536( 1.2%) 0.016( 0.7%) 0.016( 0.7%)
6 -0.0324( 0.7%) -0.0324( 0.7%) 0.010( 0.4%) 0.010( 0.4%)
7 -0.0245( 0.5%) -0.0245( 0.5%) 0.009( 0.4%) 0.009( 0.4%)
8 -0.0197( 0.4%) -0.0197( 0.4%) 0.005( 0.2%) 0.005( 0.2%)
9 -0.0111( 0.2%) -0.0111( 0.2%) 0.003( 0.1%) 0.003( 0.1%)
10 -0.0104( 0.2%) -0.0104( 0.2%) 0.002( 0.1%) 0.002( 0.1%)
11 -0.0023( 0.1%) -0.0023( 0.1%) 0.001( 0.0%) 0.001( 0.0%)
12 -0.0011( 0.0%) -0.0011( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
13 -0.0011( 0.0%) -0.0011( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
14 -0.0009( 0.0%) -0.0009( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
15 -0.0005( 0.0%) -0.0005( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
16 -0.0004( 0.0%) -0.0004( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
17 -0.0003( 0.0%) -0.0003( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
18 -0.0001( 0.0%) -0.0001( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
19 -0.0001( 0.0%) -0.0001( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
20 -0.0001( 0.0%) -0.0001( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
21 -0.0001( 0.0%) -0.0001( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
22 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
23 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
24 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
25 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
26 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
27 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
28 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
29 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
30 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
31 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
32 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
33 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
34 -0.0000( 0.0%) -0.0000( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
---------------------------------------------------------------------
Tot -4.5052(100.0%) -4.5052(100.0%) 2.418(100.0%) 2.418(100.0%)
---------------------------------------------------------------------
From fragment 2 to fragment 1
---------------------------------------------------------------------
# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)
---------------------------------------------------------------------
1 -2.2084( 72.4%) -2.2084( 72.4%) 1.532( 80.7%) 1.532( 80.7%)
2 -0.3802( 12.5%) -0.3802( 12.5%) 0.182( 9.6%) 0.182( 9.6%)
3 -0.2128( 7.0%) -0.2128( 7.0%) 0.082( 4.3%) 0.082( 4.3%)
4 -0.1511( 5.0%) -0.1511( 5.0%) 0.070( 3.7%) 0.070( 3.7%)
5 -0.0526( 1.7%) -0.0526( 1.7%) 0.020( 1.1%) 0.020( 1.1%)
6 -0.0337( 1.1%) -0.0337( 1.1%) 0.010( 0.5%) 0.010( 0.5%)
7 -0.0053( 0.2%) -0.0053( 0.2%) 0.001( 0.0%) 0.001( 0.0%)
8 -0.0027( 0.1%) -0.0027( 0.1%) 0.000( 0.0%) 0.000( 0.0%)
9 -0.0011( 0.0%) -0.0011( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
10 -0.0003( 0.0%) -0.0003( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
11 -0.0002( 0.0%) -0.0002( 0.0%) 0.000( 0.0%) 0.000( 0.0%)
---------------------------------------------------------------------
Tot -3.0482(100.0%) -3.0482(100.0%) 1.899(100.0%) 1.899(100.0%)
---------------------------------------------------------------------
The most interesting values are the totals from each fragment to the other. Both the energy and number of millielectrons would be good to have. There's two columns for each, one each for alpha and beta spin; since we're using a spin-restricted wavefunction, they're identical, and we only care about one spin.
It's been determined that the "target" lines containing the numbers we want are
Tot -4.5052(100.0%) -4.5052(100.0%) 2.418(100.0%) 2.418(100.0%)
Tot -3.0482(100.0%) -3.0482(100.0%) 1.899(100.0%) 1.899(100.0%)
but really just
(-4.5052, 2.418)
(-3.0482, 1.899)
so what text can we search for? # Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta) is a good choice; it isn't unique within the entire block, but it only appears inside this block, and it clearly starts each section. We can also search for Tot.
In [23]:
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
print(line, end='')
Now use the trick where we stick a while loop inside the if statement and call the outputfile iterator until we hit Tot:
In [24]:
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
# Do an exact character match on the string.
while line[:4] != " Tot":
line = next(outputfile)
print(line, end='')
All that each line requires is a bit of manipulation: split, take the [1::2] entries (quiz: what does this do?), get rid of the percentage stuff, map the values to floats, and return them as tuples. There's a problem though: how can we uniquely return both tuples? We could append every match to a list and return the list, but I'd rather be more explicit here since we're only dealing with two lines.
In [25]:
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
while line[:4] != " Tot":
line = next(outputfile)
print(line, end='')
line = next(outputfile)
while line[:4] != " Tot":
line = next(outputfile)
print(line, end='')
This isn't a good idea for more complicated cases (for example, it won't work if Tot is on two consecutive lines), but it works more often than not.
The lines that we just print to the screen can now be manipulated and assigned to unique variables:
In [26]:
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
while line[:4] != " Tot":
line = next(outputfile)
f_1_to_2 = tuple([float(x[:-8]) for x in line.split()[1::2]])
print(f_1_to_2)
line = next(outputfile)
while line[:4] != " Tot":
line = next(outputfile)
f_2_to_1 = tuple([float(x[:-8]) for x in line.split()[1::2]])
print(f_2_to_1)
Notice that the list(map(float, line.split())) trick can't be used, because we are just doing a type conversion for each element, but also a slicing operation. We could also do the slicing operation with a map and an anonymous function, but it doesn't look as nice:
In [27]:
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
while line[:4] != " Tot":
line = next(outputfile)
f_1_to_2 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
print(f_1_to_2)
line = next(outputfile)
while line[:4] != " Tot":
line = next(outputfile)
f_2_to_1 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
print(f_2_to_1)
Maybe it looks fine; if you've never used an anonymous function before it can be a bit odd. I just tend to write the former with the explicit list comprehension.
Now turn it into a function:
In [28]:
def get_eda_covp_totals(outputfilepath):
"""Given a path to an output file, return the totals for each fragment from the COVP analysis.
The first element of the tuple is the energy contribution, the second element is the
number of millielectrons transferred."""
searchstr = "# Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta)"
with open(outputfilepath) as outputfile:
for line in outputfile:
if searchstr in line:
while line[:4] != " Tot":
line = next(outputfile)
f_1_to_2 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
line = next(outputfile)
while line[:4] != " Tot":
line = next(outputfile)
f_2_to_1 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
return f_1_to_2, f_2_to_1
In [29]:
f_1_to_2, f_2_to_1 = get_eda_covp_totals(outputfilepath)
print(f_1_to_2)
print(f_2_to_1)
What are some of the advantages of these approaches?
next() a fixed number of times, etc.